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Abstract We propose a model to analyze citation growth and influences of fitness 
(competitiveness) factors in an evolving citation network. Applying the proposed 
method to modeling citations to papers and scholars in the InfoVis 2004 data, a 
benchmark collection about a 31-year history of information visualization, leads 
to findings consistent with citation distributions in general and observations of the 
domain in particular. Fitness variables based on prior impacts and the time factor 
have significant influences on citation outcomes. We find considerably large effect 
sizes from the fitness modeling, which suggest inevitable bias in citation analysis 
due to these factors. While raw citation scores offer little insight into the growth 
of InfoVis, normalization of the scores by influences of time and prior fitness offers 
a reasonable depiction of the field's development. The analysis demonstrates the 
proposed model's ability to produce results consistent with observed data and to 
support meaningful comparison of citation scores over time. 

Keywords Citation analysis • Normalized citation scores • Preferential attach- 
ment • Fitness • Citation network ■ Scholarly impact • Information visualization 



Introduction 

Citation frequency is a ba sic indicator of the use and usefulness of scientific pub- 
lications (jPritchardl I1969T) . Citation analysis has been commonly used to evaluat e 
scholarly productivity and impact ( Garfieldl Il972t ICronin and OverfeltL Il994h . 



However, due to human subjectivity in citation behaviors and a wide spectrum of 
factors involved in a scholar's decision to include a reference, citation frequency is 
not an unambiguous quantity for ob jective evaluati on of scholarly communications 



(jNicolaisenl l2007t IZvczkowskil [20091. According to iGarfieldl (| 19721) . citation scores 



are associated with many variables beyond scientific merit. 
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It has been recognized that citation growth is a process of cumulative advan- 
tage, in which "success seems to breed success" (|PriceLll976l p. 292). Early players 



are likel y to dominate in gaining cit ations given the advantage of entry time. In 
terms of iBarabasi and Albertl (1999), complex systems such as a citation network 



demonstrate scale-free properties as a result of network growth with preferential 
attachment. Specifically, when a node enters a scale-free network, it is more likely 
to connect to (cite) those that have been more strongly connected (highly cited) . 
The scale- free model nicely explains distributions of connectivity that decay with a 
power law function, commonly ob served in real world networks such as the world 
wide web and citation networks (|Rednerl . Il998t lAmaral et al l2000bt IBarabasi 
120091: iMatiaz and Percl . l2010h . 

Despite its model simplicity and effectiveness in regenerating related distri- 
butions, the preferential attachment mechanism in the scale-free model only rep- 
resents partial reality. While many real-world connectivity distributions show a 
long tail, they are rarely perfect power-laws that are free of scales. Due to con- 
straints such as aging and limited capacities to receive new connections, certain 
cate gories of these netwo rks demonstrate single-scale or broad-scale characteris- 
tics ( Amaral et a 3. l2000a|) . Broad-scale structures such as co-authorship networks 



demonstrate a power-law region followed by an exponential or Gaussian cutoff 
because of individual capacities to collaborate. 

In addition, preferential attachment alone does not sufficiently depict the real- 
ity given the common observation that competitive latecomers do have chances 
to break the loop and play important roles in growing network communities. In 
written communications, an article wit h great scienti fic metric may attract lots of 
citations even if it is published lately ( Rednerl Il998l) . This recognition of compet- 



itiveness, in addition to the time factor in preferential attachment, has triggered 
research on new models in network science. 



According to lBianconi and Barabasil (12001'), a node's growth in connectivity in 



a network depends on its fitness to compete for links. Fitter nodes have the ability 
to overcome highly connected nodes that are less fit. In the fitness model, entry 
time as well as factors associated with a node's competitiveness (fitness) account 
for its ultimate connectivity. A fitness network demonstrates not only the rich-get- 
richer effect (dominance of early players) but also the fitter-get-richer phenomenon 
(opportunities for latecomers to surpass the established) . 

The ideas of network growth, preferential attachment, and fitness have impor- 
tant implications in citation analysis. We have acknowledged that raw citation 
count is not a fair vehicle for scholarly impact evaluation. Particularly, time is a 
factor that likely hinders meaningful comparison of papers published in different 
years. While one may suggest the use of yearly averages to normalize citation 
scores for a comparative evaluation, rese arch has clearly indicated that c i tation 
growth is not a linear function of time dRobert G. Sumne 1 Il995t iGuptaj . Il997t 



IBarabasi and Albertl . Il999t : iBianconi and Barabasil l200ll: IZhu et all 12003}) . How 

to isolate the influence of time in citation analysis requires close examination of 
this relationship. 

Furthermore, we reason that fitness is a very broad notion and, in the context 
of citation analysis, potentially represents a variety of constituent variables. To 
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understand quantitatively the process of paper£] competing for citations requires 
a mathematical model in which the notion of fitness is integrated and can be 
factorized into related variables in citation data. 

In the light of preferential attachment and fitness, this research aims to build 
a simple, general model to quantify citations and analyze scholarly impacts in 
evolving citation networks. The model will integrate time as well as related fitness 
factors in the modeling and offers a means to single out contributions (bias) on 
citation scores for comparative analysis. We will conduct a case study to validate 
the proposed model and to demonstrate its utility in the evaluation. 



Proposed Model 

We present a fitness model to analyze citation distributions over time. The pur- 
pose of this modeling is to quantitatively offer insight into citation characteristics 
and evolving patterns in various domains. While its applicability can be verified 
with real data, the model will incorporate important variables and take into ac- 
count their relations in contributing to scholarly impact. By quantifying individual 
factors' contributions, we can estimate key parameters in the model and obtain 
important quantitative descriptors about the development of a domain in question. 

We describe the proposed model by introducing three key aspects in the analy- 
sis. It is apparent that the number of citations a paper has received reflects several 
factors in the following respects: 1) quality, merit, and contribution of research 
presented in the paper; 2) attractiveness of the paper due to existing influences 
of its authors and publication venue; and 3) age of the paper which allows for 
the accumulation of citations over time. We model citation-based scholarly impact 
using these (abstract) factors and elaborate on model formulation below. 



Citations over Time 

Research has identified some commo n patterns about how citations accumulate 
over time. According to lGuptal ( 1997|) . a citation decay curve consists of two parts: 



an increase of citations during first couple of years followed by gradual decline of 
citations when the paper gets older, as shown in Figure [1] (a). The cumulative 
trend is illustr ated in Figure [D (b ) . Sim il ar pattern s have been found in related 
studies such as lRobert G. Sumnerl (|l995h : IZhu et all (|2003l) . 



Viewing papers as nodes and citations as directed arcs connecting the nodes, 
network science research has provided important methods and tools to model cita- 
tion frequency based on connectivity probabilities in an evolving citation network. 
The scale-free model, among others, provides insights into the basic mechanisms 
behind power-law degree distributions commonly observed in a wide spectrum of 
real networks. It models the outcome of such a distribution based on two simulta- 
neous processes, namely netw ork growth and preferential a ttachment. The original 
scale-free model proposed by iBarabasi and Albertl (jl999h relies on a probability 
function linear to existing connectivity. That is, the likelihood that a new node 
connects to an existing node i (paper i) is proportional to node i's degree (the 



1 We use papers, articles and publications interchangeably. In the data used for this study, 
a paper may refer to a research article, a book chapter, or a book. 
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(a) Citations (b) Cumulative citations 



Fig. 1 Citations c(t) vs. time t. (a) illustrates a schematic citation distribution over time; (b) 
presents the cumul a tive d istribution and a cc(t) oc approximation of the schematic trend. 
Compare to lGuptal (119971^ . 



number of citations paper i has already received). The increase of the i th node's 
degree ki over time can be computed by the probability of a new node connecting 
to the i th node 77,: 

f)k h 

-g F = m n i = m- w ^ r - (1) 

where m is the initial degree of each node upon its introduction at time 
and N is the total number of nodes in the network at time t. Solving the above 
equation leads to: 

k t (t)=m(lf (2) 
with ft = 7j. Let Ti = t/ti denote the time factor, the above can be written as: 

k l (t)=mr l P (3) 

Given ft € (0, 1), the function roughly approximate the schematic trend of 
cumulative citations over time, as illustrated in Figure[T](b). While there are other 
citatio n aging functions such as those proposed by iBurrelll (|2002l) and IZhu et al 
(120031) . the scale-free model is very generalizable and has produced results consis- 
tent with many real world networks in power-law frequency distributions. For this 
reason, we adopt the functional form of r* 3 in the proposed model, where j3 is to 
be estimated. Although this is not necessarily the most precise method to model 
citation growth, it does capture the decaying pattern of citations over time, as 
shown in Figure [1] (b) . 



Fitness Modeling 

So far, the model in Equation [3] is solely based on the time factor r, similar to 
the original scale free model. Incorporation of factors related to individual nodes' 
competitiveness has led to new formulations such as the fitness model, in which 
younger nodes with a higher fitness parameter can overcome the dominance of early 
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players (Bia nconi and Barabasil. l200ll) . We reason that a paper's fitness, in the 



citation analysis context, is associated with scholarly establishment and scientific 
merit of that written communication. This represents a dimension independent of 
time and its impact on citations manifests over time. Similar to parameter m in 
Equation [3l the fitness factor potentiates a node's ability to attract citations and 
can be measured at the moment of its introduction. Different from ra, however, 
fitness represents individual competitiveness and varies from one node to another. 
By introducing fitness factor % of node i to Equation [3] we obtain the following 
fitness model for citation analysis: 

fei(t)oc%rf (4) 

where r\i represents a collection of factors associated with a node's competi- 
tivene ss in the citation network an d can be further factorized by these variables. 
While iBarabasi and Albertl (|l999|) obtained /3 = 1/2 in the scale- free model, we 
leave this to empirical validation in the data. 

Suppose a number of factors contribute to a paper's fitness. Assume n factors, 
[<^ii> 4>2ii •■; 4>n i]i can be measured in the data while others are unknown and de- 
noted as ej. Examples of <\> variables include evidence about existing influences of 
a paper's authors, e.g., how frequently the authors have been cited prior to the 
paper's publication. These factors in a sense represent the introductory degree of 
a node, similar in spirit to parameter m in Equation [3] 

Seen in the light of a power-law degree distribution, there is a huge divide in 
connectivity between highly cited nodes and those that are rarely connected to. To 
integrate these degree-related factors in n requires normalizing their values (cita- 
tion scores) from magnitudes' differences to a reasonable scale. Log transformation 
appears to be a reasonable step in the modeling. We propose: 



ln?7i = c + ^7„ln<?!> n! j + <51ne, ; (5) 



where "/ n and S represent weights of the contributing factors. 
Equation [5] is equivalent to: 



^(IfeK (6) 

n 

Replacing % with Equation [6] in Equation [31 we get the final fitness model: 



ki(t) = ar)iT? 

n 

where r is the time factor and 4> denotes factors about a paper's fitness in 
terms of existing influences. The coefficients a, /3, and 7« can be estimated from 
data. Let a' = In a and = 51ne^. The above model is equivalent to the following 
equation after logarithmic transformation, for which generalized linear regression 
can be performed to estimate the coefficients. 
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In ki (i) = a + ^ j n In n ,j + /3 In + <5e ■ (8) 



Model Implications 

Based on the model presented in Equation [7] or equivalently Equation [8] we can 
quantify proportionally various factors' contributions to a paper's scholarly impact 
(citations). We have taken into account three categories of variables, namely r the 
time factor, 4> variables about established influences prior to a paper's publication 
which are potentially measurable, and other unknown variables contributing to a 
paper's fitness summarized as e. 

By singling out the individual variables and estimating related coefficients from 
data, the model supports evaluation of scholarly impact at multiple levels. For 
example, isolating the impact of time factor r will enable examination of a paper's 
fitness and fair comparison of papers regardless of their ages. In addition, suppose 
in the data analysis we can include in <f> exhaustive variables about established 
influences prior to a paper's publication (e.g., authors' prior impacts), then the e 
variable is a surrogate of remaining factors about a paper's actual fitness. In this 
case, quantifying e will offer insight into a paper's own ability to attract citations 
because of its scientific merit and contribution to the field, rather than due to 
other prior, external factors. 

Finally, we observe that the proposed model has the potential for causality anal- 
ysis or prediction of citation scores. In Equation there is a time sequence from 
right (independent variables) to left (dependent variable). Besides the r factor, all 
variables are about factors prior to or upon the publication of a paper. They can 
be measured at the time of publication. A citation score ki can then be seen as the 
result of these factors over the course of time r. Because of this time sequence, it 
is plausible - though not in definitive terms - to tell a causal relationship between 
predictor variables (f> and ultimate citation scores. 



Model Validation and Data Analysis 

We apply the proposed fitness model to a collection of 31 years' citation records in 
information visualization (Info Vis) to validate the model and to analyze evolving 
patterns about the domain. In this section we describe the data, related variables 
used in the fitness model for paper citations, and a derived model for scholars in 
the analysis. We discuss results and insights from the analysis in the next section. 



Data 

The InfoVis data set is a collection of major publications in the emerging in- 
formation visualization field during 1974 - 2004 retrieved from the ACM Digital 
Libraries. It was prepared by the IEEE Information Visualization Contest in 2004 
to depict the early hist ory of the field and made available as part of the InfoVis 
Benchmark Repository (|Ke et a I l2004t iPlaisant et all. I200I . 
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According to lFekete et all (2004), the data set contains meta data of important 
publications on information visualization collected from multiple venues and is 
representative of the early development (emergence) of the field. The original data 
is in the XML (Extensible Markup Language) format, which we convert to a 
relational database. Figure [2] shows the database schema with major tables and 
relationships. 



sources 


PK 


id 




source 

type 



article authors 


PK.FK1 
PK 


article id 
author id 


11 


author 
sequence 



articles 


PK 


id 




title 


FK1.I2 


source id 




pages_from 




pages_to 




abstract 


11 


year 



references 


PK.FK1 
PK.FK2 


citina id 
cited id 




reference 
year 



article_keywords 


PK.FK1 
PK 


article id 
kevword 







Fig. 2 Data schema of the Info Vis 2004 collection 



Each citation record has information such as a paper's title, authors, abstract, 
keywords, source, references, number of pages, and the year of publication. One pa- 
per (acm673478) has no author and is removed from the data. We perform author 
name unification through automatic name normalization and manual correction. 
The final data set contains 613 papers with 1,036 unique authors/scholars and 
8,502 references to papers within and without the set. 

Yearly distributions of the number of publications, the number of references, 
and the number of citations are shown in Figure [3] Because the focus of this study 
is on scholarly impact and fitness factors based on citation scores, data fields related 
to content such as title and abstract are not used in the modeling. 



Fitness model for papers and related variables 

In terms of the fitness model in Equation [7] we identify related variables from 
the Info Vis data for model validation and data analysis. We use generalized linear 
regression in the equivalent form of Equation[8]to estimate related parameters and 
to examine influences of identified variables. We use fcj to denote the number of 
citations a paper i received within the collection. Time Tj represents the age of 
paper i when data were collected in 2004. The Info Vis data also have evidence 
about existing impacts prior to a paper's publication, denoted as 4> variables: 

1. Authors' prior impact factor <f> a ^ refers to the number of citations authors of 
paper i received (for earlier works) before the paper's publication. 

2. Venue's prior impact factor cj> v i is the average number of citations to (earlier) 
papers at the venue where paper i was published before its publication. 



8 



Weimao Ke 



100 



o 
O 



40 



80 








1974 



1979 



1984 



1989 



1994 



1999 



2004 



Year 



Fig. 3 Yearly distributions of the Info Vis 2004 collection. The number of references refers to 
the number of references contained in papers published in a year, or the number times other 
works are referenced by papers published in that particular year, divided by 10 to fit the plot. 
The number of citations denotes the number of times papers published in a year are cited in 
following years, divided by 10 to fit the plot. 

3. References' prior impact factor <f> r , denotes the number of citations to works 
referenced by paper i prior to its publication. 

Note that all variables rely solely on records in the Info Vis collection. We do 
not seek additional citation information about the collected papers from external 
sources. Replication of the analysis reported in this article is straightforward. It 
can be conducted on many other domains where representative citation records in 
a given time period are available . 



Derived model for scholars 

The proposed model has so far focused on the fitness of nodes in citation analysis, 
where nodes represent papers. For an analysis from the perspective of scholars 
(authors) , a second model can be derived from Equation [7] We treat an author's 
citation score from a paper as a fair share of the authorship. Using fractional count, 
we distribute the credit for a multi-auth ored work equally among its contributors 
(|LindsevLll980l:lLee and Bozemanl . l2005l) . That is, the citation score of each scholar 
(author) j of paper i is k s j = ki/ci, where fcj is citation frequency of paper i and 
Cj is the number of contributors (authors) of the paper. 

One additional factor for modeling a scholar's citation frequency is the number 
of papers he or she has authored, denoted as p. This being considered, factors 
about authored papers' fitness need to be normalized (averaged) so that p is not 
redundant to existing contributions in individual papers. We propose the use of 
geometric mean to average contributing variables r and cj> for each author. For 
example, given a set of values [v\,V2, ...,v Pj ] for variable v observed in pj papers 
authored by scholar j, the geometric mean Vj is computed by: 
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Pj 

Vj = (Y\_ v i) 1/Pj = "i/ v i -v2---v Pj (9) 

z=l 

While arithmetic mean is the common approach to averaging citation scores, 
resea rch has also adopted harmonic mean and geometric mean in citation anal- 
ysis (jGlanzel and Schubertl . Il993l : ISikoravl Il99ll) . Advantages of geometric mean 
include reduced standard variance and model simplicity when variables are log- 
transformable. Using geometric means and additional variable pj, the fitness model 
for scholar j according to Equation [7] can be written as: 

"(lR<)vv ; • ;l,, < 

n 

=<n((nc-) iM r")(n^) 1M ^ (») 

n i=l i=l 

=Kn(dl^)) 1M ^i ( i2 ) 

i=l n 

Here (Y\ n </>Z rl i) T i turns out to be part of the fitness model for paper i in 
Equation!?] By using geometric means, the two models for papers and scholars are 
tightly associated. This derived fitness model for scholars, shown in Equation ll21 
can be seen as aggregation of normalized individual papers' contributions toward 
author citations. 

Results 

We present model validation and analysis results from modeling the Info Vis data. 
We focus on the fitness model for papers and discuss results from the derived model 
for scholars (authors) as well. We also present results from additional analyses of 
frequency distributions and multi-authorship impacts. 

Model validation 
Fitness model for papers 

Based on the fitness model for papers in Equation[JJ generalized linear regression of 
the Info Vis data produces estimates of coefficients in Table [1] As results indicate, 
all estimates are statistically significant in the Info Vis data, where prior fitness <j> 
and time r variables contribute positively to paper citation frequencies. 

The fitness model for papers based on estimates in Table [1] can be expressed 
as Equation 1131 where the growth of cit ations k over time r follows the function 
t ' 57 , close to scale-free model derivation (jBarabasi and AlbertLll999l) . While prior 
impact factors cj> all contribute to a paper's overall ability to attract citations, 
authors' prior impact factor <f> a appears to have a greater impact j a = 0.326. 
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Table 1 Fitness model for papers in the Info Vis data 



Coefficient 


Estimate 


Std Error 


t value 


Pr(>]t|) 




a 


-0.771 


0.115 


-6.72 


4.3E-11 




fa- author impact (f> a 


0.326 


0.0199 


16.3 


5.5E-50 


*** 


7„: venue impact <p v 


0.0814 


0.0365 


2.23 


0.026 


* 


■y r : refernece impact <f> r 


0.0395 


0.0135 


2.93 


0.0035 


** 


/3: time factor r 


0.573 


0.048 


11.9 


1.1E-29 


*** 



Ft 2 = 0.473 (adj. 0.469), F = 136 on 4 and 608 DF 



; U\ -0.771 ,0.326 ,0.0814 ,0.0395 0.573 

fe(tj = e ■ <p a ■ <p v ■ <p r • r 



,0.326 ,0.0814 ,0.0395 0.573 / / 1 ,\ 

t> a ■ <p v ■ (p r ■ T ■ e (16) 



The fitness model for papers explains nearly 50% of citation score variances in 
the Info Vis data (R 2 = 0.473 and adjusted R 2 = 0.469). G iven only four factors 
i nclud ed in the model, this is relatively high. For example, IPeters and van Raanl 
( 19941) studied fourteen determinants of citation scores in the discipline of chemical 



engineering and their model explained 58% of the variance. Ohter models, with 
an aim to boost pr ediction accuracy, inv olved a wide spectrum of content and 
bibliometric factors ( Fu and 



The proposed model only takes into account external variables such as prior 
fitness factors and time. Without analysis of inherent characteristics such as paper 
content and scientific merit, the nearly 0.5 coefficient of determination is consider- 
ably large. This supports the assertion that citation growth is in deed a cumu lative 
advantage process, in which success extensively breeds success (|Pricd . ll976l) . 



Fitness model for scholars 

Modeling scholar fitness in the Info Vis data based on Equation [12] produces esti- 
mates in Table[2] Authors' prior impact factor <j> a (geometric mean), time factor f 
(geometric mean), and the number of papers p all have significant impacts on the 
citation outcome. While venues' prior impacts <pv and references' prior impacts 
4>r do not show significant influences on citation outcomes in the data, we reason 
they contribute positively to citations and their non-significant is likely due to 
their association with other factors such as prior author impact <f> a - 

Given estimates in Table [2] the fitness model for scholars can be expressed as 
Equation [TJ] Citation growth over time follows the rough f unctional form of t 0,4 , 
to which the scale-free model remains a fine approximation ( Barabasi and Albert! 



Il999l) . Apparently, the number of papers a scholar authored p has great influences 
on the citation outcome. The relation between scholar citation frequency k s and 
p is k s oc p ' 8 , close to a linear function. 



, /,\ -0.735 ,0.217 ,0.0453 ,0.0159 0.395 0.786 / 

k s (t) = e -(p a ■ (p v ■ 4> r • T -p •£ 

,0.217 ,0.0453 ,0.0159 0.395 0.786 / / 1 A \ 

= 0.479 ■ 4>a ■ (Pv ■ (Pr 'T ■ P • £ (14) 
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Table 2 Fitness model for scholars in the InfoVis data 



Coefficient 


Estimate 


Std Error 


t value 


Pr(> 1*1) 




a 


-0.735 


0.0845 


-8.7 


1.3E-17 


*** 


7a : author impact 4>a 


0.217 


0.0154 


14.1 


1.6E-41 




7„: venue impact 0„ 


0.0453 


0.0443 


1.02 


0.31 




7 r : reference impact <j> r 


0.0159 


0.00842 


1.88 


0.06 




/3: time factor f 


0.395 


0.0303 


13 


4E-36 


*** 


k: # authored papers p 


0.786 


0.0276 


28.5 


4.5E-132 





R 2 = 0.629 (adj. 0.627), F = 349 on 5 and 1030 DF 



Again, the analysis indicates significant impa cts of preferential attachment in ci- 
tatio n growth as a cumulative advantage process ( PriceL 1976; Barab asi and Albertl 
Il999l) . The large R 2 > 0.6 from modeling InfoVis scholars suggests that there is 
an extensive rich-get-richer and fitter-get-richer effect. Scholarly productivity and 
impact evaluation based on raw citation scores is not necessarily fair given existing 
advantage of early players and bias caused by scholarly establishment. Isolating 
the influences of time and prior fitness factors may lead to new insight into the 
evaluation, which we will discuss later. 



Validation of citation frequency distributions 

Connectivity distribution analysis has been an important tool in network science 
research. A major goal of various complex network models has been to reproduce 
important patterns/characteristics in these distributions. Here we use the two 
models discussed above to generate citations distributions for papers as well as for 
scholars. Figures|3](a) and (b) show cumulative frequency distributions for the two 
models respectively and compare their predicted results to observed distributions 
in the data. 



Distributions: 
Observed distribution 
Predicted distribution 





Distributions: 
— •— Observed distribution 
— e- Predicted distribution 






X 

\ 











50 100 200 



(a) papers (b) scholars 

Fig. 4 Cumulative citation score distributions. Both figures are on log/log coordinates, (a) de- 
picts observed and predicted citation distributions of papers, (b) shows observed and predicted 
citation distributions of scholars. Data points are weighted proportionally to their citation 
scores in the fitness models plotted here. 



12 



Weimao Ke 



In general, distributions generated by the proposed fitness models manifest 
cumulative patterns similar to those observed in the data. Predicted citation fre- 
quencies appear to be conservative estimates of observed frequencies. Overall, the 
models produce more rarely cited nodes (top-left in both figures) and fewer highly 
cited ones (bottom-right in both figures) . For highly-cited nodes that are predicted 
by the models, their citation frequencies are smaller than actual values (bottom- 
right in both figures) . Despite these local differences, predicted (model-generated) 
and observed distributions look consistent. 

Top papers and scholars in InfoVis 



Table 3 Top rank papers by citation scores k in the data. Column kt has citation scores 
normalized by time factor r, that is kt = k/r^ . Column k t f denotes citation scores normalized 
by time factor r and <f> variables, that is k t j = k/(r^ Y\ 0n")- Column k acm is the number 
of citations identified for each paper in the ACM DL in 2012. 



No. 


Title 


Year 


k 


kt 


ktf 


kacm 


1 


Cone Trees: animated 3D... 


1991 


70 


33.9 


6.4 


300 


2 


The perspective wall: detail and... 


1991 


30 


14.8 


2.7 


200 


3 


Visual information seeking: tight... 


1994 


29 


16.4 


3.2 


230 


4 


Information visualization using 3D... 


1993 


28 


15.1 


2.5 


142 


5 


Tree-Maps a space-filling approach to 


1991 


28 


13.8 


2.8 


212 


6 


The table lens: merging graphical and 


1994 


24 


13.7 


2.7 


151 


7 


Pad-| — h: a zooming graphical interface... 


1994 


23 


13.1 


4.5 


202 


8 


Pad: an alternative approach to the... 


1993 


22 


12.0 


6.7 


151 


9 


Stretching the rubber sheet: a... 


1993 


22 


12.0 


3.7 


79 


10 


Dynamic queries for information... 


1992 


19 


9.9 


2.1 


127 


11 


A review and taxonomy of distortion-or 


1994 


19 


10.9 


3.8 


120 


12 


Tree visualization with tree-maps: 2-d 


1992 


18 


9.5 


2.3 


215 


13 


Graphical Fisheye Views 


1993 


16 


8.9 


2.6 


122 


14 


Toolglass and magic lenses: the see-th 


1993 


15 


8.3 


3.3 


346 


15 


Parallel coordinates a tool for... 


1990 


15 


7.3 


2.8 


115 


16 


The movable filter as a user... 


1994 


13 


7.7 


2.5 


72 


17 


Worlds within worlds: metaphors for... 


1990 


13 


6.1 


2.7 


51 


18 


The dynamic HomcFinder: evaluating... 


1992 


12 


6.5 


1.5 


75 


19 


Interactive graphic design using... 


1994 


12 


7.1 


2.0 


19 


20 


To see, or not to see- is That the... 


1991 


10 


5.2 


3.0 


24 




Correlation with k acrn : 


0.586 


0.593 


0.394 





With treatments on time and fitness factors, the proposed model has the po- 
tential to single out these variables and to identify an individual node's ability 
for long-term growth. The InfoVis data were prepared in 2004 to document the 
birth and early history of the field. Now that many years have past, there is new 
evidence about how well nodes (papers and scholars) have grown in citations. We 
use the total number of citations identified for each paper or scholar in the ACM 
digital libraries in 2012, denoted as k aC m, as a surrogate of its long-term impact. 

We sort papers by their overall citation scores k within the InfoVis 2004 data 
and select the top 20 for Table [3] We reason that removing the time factor r from 
citation scores, among others, supports a fairer comparison of papers published in 
various years. This leads to weighted citation scores kt based on time normalization 
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kt = kjT and further reduction of fitness factors in k t f = k/(r Yl n <i>n n ) (see 
additional kt and k t f columns in Table [3]). While k and kt are in general consistent 
with the kacm outcome, kt appears to have a slightly higher correlation with k a cm 
whereas kf t , with removal of both time and prior fitness factors, has a weaker 
correlation. 

Table 4 Top rank scholars by citation scores k B in the data. Column kt has citation scores 
normalized by time factor r, that is kt = k/fP; whereas k t f denotes citation scores normalized 
by time factor r and <f> variables, that is k t j = k/(fP J~J 0n™)- Column k aC m is the number 
of citations identified for each scholar in the ACM DL in 2012. 



No. 


Scholar name (mean pub year) 


ks 


kt 


k t f 


kacm 


1 


B. Shneiderman (1995) 


94 


82.5 


31.7 


5389 


2 


J. D. Mackinlay (1995) 


87 


77.2 


28.2 


1736 


3 


S. K. Card (1995) 


82 


70.3 


24.3 


3547 


1 


G. W. Furnas (1994) 


81 


69.2 


25.7 


1595 


5 


G. Robertson (1994) 


60 


50.9 


17.3 


2177 


6 


E. R. Tufte (1988) 


58 


40.3 


18.7 


884 


7 


C. Ahlberg (1994) 


33 


27.8 


10.6 


535 


8 


R. Rao (1995) 


30 


27.5 


11.1 


271 


9 


W. S. Cleveland (1988) 


27 


19.7 


9.6 


311 


10 


T. Munzner (1997) 


22 


21.9 


11.8 


318 


11 


B. Bederson (1998) 


22 


23.2 


10.4 


2094 


12 


S. K. Feiner (1992) 


20 


16.8 


9.4 


2656 


13 


P. Pirolli (1996) 


18 


17.6 


6.8 


1396 


11 


S. G. Eick (1997) 


18 


17.5 


8.6 


736 


15 


B. Johnson (1991) 


16 


13.1 


5.5 


241 


16 


S. F. Roth (1995) 


16 


15.0 


7.1 


471 


17 


J. D. Hollan (1995) 


14 


13.6 


6.6 


967 


18 


M. H. Brown (1993) 


14 


12.1 


6.1 


622 


19 


M. Chalmers (1997) 


13 


13.7 


8.1 


594 


20 


J. Lamping (1994) 


13 


12.0 


1.9 


649 




Correlation with fc acm : 


0.685 


0.710 


0.707 





Based on the fitness model for scholars, we perform a similar analysis of 
these variables, namely: 1) the raw citations of a scholar in the data k s , 2) time 
normalized citations kt = k/f^, 3) time and prior fitness normalized citations 
k tf = k/(f n„ <Pn), and 4) citations in ACM DL by 2012. As shown in TableU 
both kt and k t f have stronger correlations with the long-term citation growth 
kacm than raw citation score k s does. Note that the normalized score k t f rep- 
resents unknown fitness factors in the analysis and may be further factorized in 
future studies. No significance test has been performed on this specific analysis 
given the small samples. 

Paper fitness over time 

The Info Vis data are about the birth and growth of the scientific field of infor- 
mation visualization. Citation records in the data, though not a exhaustive col- 
lection about the field, are generally considered representative of its development. 
Plotting raw citation scores of papers in the data over time, however, shows a 
counter-intuitive picture. In Figure [5] (a), average citation score k peaks in late 
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1980s - early 1990s and decreases continuously afterwards. While we understand 
that this trend is mainly due to the lack of citation data for recent publications, 
tracking raw citation scores for analyzing the development of a field does not offer 
much insight. Note that there is only one paper published in 2004 in the collection, 
which has no citations and is not included in the plots. 




1975 1985 1995 1975 1985 1995 1975 1985 1995 



t t t 

(a) citations k (b) k normalized by r (c) k normalized by t & <j> 

Fig. 5 Citation scores over time. Each data point represents the citation score y of a paper 
published in year x. The solid line (curve) indicates yearly averages. Vertical lines show the 
9-year period of the Info Vis conference from 1995 - 2003. In all figures, x is an ordinary axis 
and the y axis is logarithmic. Figure (a) shows the yearly distribution of raw citation scores 
1974 - 2004. Figure (b) is the yearly distribution of citation scores normalized by time factor 
T, that is kt = k/r^. Figure (c) is the yearly distribution normalized by time factor t and <f> 
variables, that is k t f = k/(r@ Y\ n 



Plots of the normalized scores kt and k t t over time, as shown in Figure [5] 
(b) and (c) respectively, tell a different story. Time normalized kt also has peaks 
in late 1980s but moves roughly constantly over time after early 1990s. With 
k t f normalized by time and prior fitness factors, there appears to be an overall 
in cremental developm ent over the years, especially during 1995 - 2003. According 
to lFekete et a 1 (2004), 1995 marked an important milestone of the field when the 
first IEEE Symposium on Information Visualization was held. The Info Vis data 
in this analysis include 9 years' proceedings of the conference from 1995 - 2003. It 
is quite certain that the field experienced healthy growth during this period and 
the k t f plot in Figure [5] (c) is relatively consistent with this observation. 



Summary of model validation and analysis 



The modeling and validation rely on generalized linear regression analyses, ci- 
tation frequency distributions, comparison with long-term citation evidence, and 
examination of (normalized) citation frequencies over time. We have found the 
proposed fitness model to be a useful tool for scholarly impact evaluation, which 
offers insight consistent with observations about citation growth in general and 
the Info Vis field in particular. Models produce significant results with the Info Vis 
data and regenerate citation distribution patterns similar to those in the data. 
The fitness models, involving time and prior fitness factors, explain a significant 
portion of citation variances (R 2 « 50% in the model for papers and R « 60% for 
scholars). Isolation of these factors offers good estimation about nodes' (papers' 
and scholars') ability to gain citations in the long term. Normalization of citation 
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scores by time and prior fitness factors also leads to a more reasonable depiction 
of Info Vis development in its recent history. 



Additional data analysis 
Paper fitness distributions 

In Figure [H we looked at cumulative distributions of raw citation scores. Figure [6] 
(a) shows the discrete, non-cumulative distribution of paper citation frequencies, 
which follows a rough power-law function linear on the log- log plot. Figure [6] (b) 
plots time normalized kt whereas (c) shows the distribution of k t t with both time 
and prior fitness normalization. 



intercept = 226 
slope = -1.5 




Exponential: 

- intercept = 612 

- slope = -1.8 



1 2 5 10 20 50 100 1 2 5 10 20 50 1 2 3 4 5 

k k/t k/tf 

(a) citations k (b) k normalized by r (c) k normalized by r & <j> 

Fig. 6 Paper citation score distributions. Figure (a) shows the distribution of raw citation 
scores. Figure (b) is the distribution of citation scores normalized by time factor t, that is 
kt = k/rP . Figure (c) is the distribution normalized by time factor r and <f> variables, that is 
ktf = k/(rP Y\ (fin")- In figures (a) and (b) x and y coordinates are both logarithmic whereas 
in figure (c) only y axis is log-transformed. 



While the kt distribution remains roughly linear on log-log (see Figure [6] (b)), 
the k t f distribution resembles an exponential form (see Figure [6] (c)). The k t f 
normalization, with the removal of contributions from time and variables related 
to prior fitness, is essentially an unknown factor about a paper's additional abil- 
ity to gain citations. It is likely a representative of constituent variables such as 
a paper's quality, scientific merit, and potential contribution to the field. In this 
sense, kf t can be seen as a measure about a paper's inherent fitness whereas prior 
fitness cj> and time r are external factors. The exponential distribution suggests a 
single- or broad-scale nature of k t f. That is there are cert ain constraints on how 
related inherent factors may vary, leading to a scale limit (jAmaral et all . l2000al) . 
Identification of additional variables in data and further factorization of k t f may 
lead to discovery of important characteristics of this distribution and better un- 
derstanding of its implications on citation analysis. 

Impact of multi-authorship 

The average number of authors per paper over time, as shown in Figure is gen- 
erally increasing over 31 years in the InfoVis data. Earlier analysis of the field has 
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shown that collaboration was a key factor in the development of Info Vis given its 
multidisciplinary nature ( Borner et all . [2005) . Analyzing multi-authorship contri- 
butions will help understand collaboration trends and impacts. 




year 

Fig. 7 The number of authors per paper over time. Data points represent invidiual papers. 
The solide line indicates yearly averages. 



Figure [8] plots paper citation score over the number of authors. It is interesting 
that several papers with seven authors have high citation scores (see the peaks in 
Figures [8] a, b, and c). We argue that these are peculiar to the data and may not 
be generalizable. If we ignore these seven-author papers as exceptions (outliers), 
there is a general decreasing trend of average citation scores k with an increased 
number of co-authors. 






# authors 
(a) citations k 



# authors 
(b) k normalized by 



# authors 
(c) k normalized by 



Fig. 8 Citation scores over # authors. Each data point represents the citation score y of 
a paper published in year x. The solid line indicates yearly averages. In all figures, x is an 
ordinary axis and the y axis is logarithmic. Figure (a) shows the yearly distribution of raw 
citation scores 1974 - 2004. Figure (b) is the yearly distribution of citation scores normalized 
by time factor r, that is kt = k/r^. Figure (c) is the yearly distribution normalized by time 
factor t and <f> variables, that is k t f = k/(r@ J~| <f>7i™)- 



With normalized citation scores kt and k t f, shown in Figure [8] (b) and (c), the 
impact of co-author team size becomes ambiguous. While it is difficult to reach 
any conclusion about large collaboration teams given the lack of data, many high- 
impact papers were results of one to three co-authors (see individual data points at 
the top of Figure [8] plots). Previous studie s have offered evidence about produ ctive 
collaboration in small teams in Info Vis ( Ke et all |2004 iBorner et all 120051) . For 
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example, strong collaboration of three key researchers J.D. Mackinlay, S.K. Card, 
and G. Robertson, among top scholars listed in Tabled produced several milestone 
works on information visualization and has been highly regarded in the field. 



Conclusion 

We propose a model to analyze citation growth and influences of fitness variables. 
Taking into account time factor r and prior fitness factors <j>, the model offers not 
only a new formula to predict growing citations but also an approach to quantifying 
influences of these factors (bias) in scholarly impact analysis. 

Applying the proposed method to modeling paper and scholar citations in the 
InfoVis 2004 data, a benchmark collection documenting the birth and 31 years' his- 
tory of information visualization, leads to findings consistent with citation growth 
in general and our observation about the domain in particular. While r and <j> 
variables have been found to have significant influences on paper citation scores, 
the overall effect size is considerably large, with R 2 « 0.5 for the paper fitness 
model and R 2 > 0.6 for the derived scholar fitness model. Citation growth over 
time follows a power function close to th at identified in the scale-fre e model, in 
which citation score k oc t p with p = 1/2 (jBarabasi and Albert! Il999t) . 

Distribution analysis and normalization of citation frequencies based on model 
estimates provide insights consistent with observations about the domain. Both 
paper and scholar fitness models reproduce citation frequency distributions that 
roughly match observed distributions. Isolating the impact of time r from raw 
citation scores produces normalized scores better correlated with a long-term ci- 
tation benchmark. While plotting raw citation scores over 30 years of InfoVis 
seems to suggest a counter-intuitive story about the field, normalizing the scores 
by influences of time and prior fitness reveals a trend consistent with our general 
understanding of the field. 

Overall, the analysis demonstrates the ability of the proposed model to produce 
results consistent with the data and to support meaningful comparison of citation 
scores. The model is based on the general reasoning behind preferential attachment 
and fitness in evolving, growing networks. The simplicity of the proposed fitness 
modeling, which relies on nothing more than citation records, enables straightfor- 
ward replication of the reported analysis. We plan to apply the model to analyzing 
other scientific domains in future studies. 
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